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Abstract 

We study a method of seed-based lossless filtration for approximate string matching and related 
bioinformatics applications. The method is based on a simultaneous use of several spaced seeds rather 
than a single seed as studied by Burkhardt and Karkkainen [:1|. We present algorithms to compute 
several important parameters of seed families, study their combinatorial properties, and describe several 
techniques to construct efficient families. We also report a large-scale application of the proposed 
technique to the problem of oligonucleotide selection for an EST sequence database. 
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I. Introduction 

FILTERING is a widely-used technique in biosequence analysis. Applied to the approximate 
string matching problem , it can be summarized by the following two-stage scheme: to 
find approximate occurrences (matches) of a given string in a sequence (text), one first quickly 
discards (filters out) those sequence regions where matches cannot occur, and then checks out 
the remaining parts of the sequence for actual matches. The filtering is done according to small 
patterns of a specified form that the searched string is assumed to share, in the exact way, with 
its approximate occurrences. A similar filtration scheme is used by heuristic local alignment 
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algorithms (0, 0, 0, 0, to mention a few): they first identify potential similarity regions 
that share some patterns and then actually check whether those regions represent a significant 
similarity by computing a corresponding alignment. 

Two types of filtering should be distinguished - lossless and lossy. A lossless filtration 
guarantees to detect all sequence fragments under interest, while a lossy filtration may miss 
some of them, but still tries to detect a majority of them. Local alignment algorithms usually 
use a lossy filtration. On the other hand, the lossless filtration has been studied in the context of 
approximate string matching problem Q, 0. In this paper, we focus on the lossless filtration. 

In the case of lossy filtration, its efficiency is measured by two parameters, usually called 
selectivity and sensitivity. The sensitivity measures the part of sequence fragments of interest 
that are missed by the filter (false negatives), and the selectivity indicates what part of detected 
candidate fragments don't actually represent a solution (false positives). In the case of lossless 
filtration, only the selectivity parameter makes sense and is therefore the main characteristic of 
the filtration efficiency. 

The choice of patterns that must be contained in the searched sequence fragments is a key 
ingredient of the filtration algorithm. Gapped seeds (spaced seeds, gapped g-grams) have been 
recently shown to significantly improve the filtration efficiency over the "traditional" technique 
of contiguous seeds. In the framework of lossy filtration for sequence alignment, the use of 
designed gapped seeds has been introduced by the PatternHunter method and then used 
by some other algorithms (e.g. 0, 0). In 0, 0, spaced seeds have been shown to improve 
indexing schemes for similarity search in sequence databases. The estimation of the sensitivity 
of spaced seeds (as well as of some extended seed models) has been subject of several recent 
studies OH, 0J], H2), El, OH, ffUl. In the framework of lossless filtration for approximate 
pattern matching, gapped seeds were studied in (see also 0) and have also been shown to 
increase the filtration efficiency considerably. 

In this paper, we study an extension of the lossless single-seed filtration technique 0. The 
extension is based on using seed families rather than individual seeds. The idea of simultaneous 
use of multiple seeds for DNA local alignment was already envisaged in and applied in 
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PatternHunter II software [U6ll . The problem of designing efficient seed families has also 
been studied in ifTTll . In lfT8l . multiple seeds have been applied to the protein search. However, 
the issues analysed in the present paper are quite different, due to the proposed requirement for 
the search to be lossless. 

The rest of the paper is organized as follows. After formally introducing the concept of multiple 
seed filtering in Section [III Section [III] is devoted to dynamic programming algorithms to compute 
several important parameters of seed families. In Section [TV] we first study several combinatorial 
properties of families of seeds, and, in particular, seeds having a periodic structure. These results 
are used to obtain a method for constructing efficient seed families. We also outline a heuristic 
genetic programming algorithm for constructing seed families. Finally, in Section |V] we present 
several seed families we computed, and we report a large-scale experimental application of the 
method to a practical problem of oligonucleotide selection. 

II. Multiple seed filtering 

A seed Q (called also spaced seed or gapped q-gram) is a list {pi,p2, ■ ■ ■ ,Pd} of positive 
integers, called matching positions, such that p\ < p 2 < ■ ■ ■ < Pd- By convention, we always 
assume p\ = 0. The span of a seed Q, denoted s(Q), is the quantity pa + 1. The number d of 
matching positions is called the weight of the seed and denoted w(Q). Often we will use a more 
visual representation of seeds, adopted in [1], as words of length s(Q) over the two-letter alphabet 
{#,-}, where # occurs at all matching positions and - at all positions in between. For example, 
seed {0,1,2,4,6,9,10,11} of weight 8 and span 12 is represented by word ###-#-# — ###. 
The character - is called a joker. Note that, unless otherwise stated, the seed has the character 
# at its first and last positions. 

Intuitively, a seed specifies the set of patterns that, if shared by two sequences, indicate a 
possible similarity between them. Two sequences are similar if the Hamming distance between 
them is smaller than a certain threshold. For example, sequences CACTCGT and CACACTT are 
similar within Hamming distance 2 and this similarity is detected by the seed ##-# at position 
2. We are interested in seeds that detect all similarities of a given length with a given Hamming 
distance. 
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Formally, a gapless similarity (hereafter simply similarity) of two sequences of length m is 
a binary word w E {0, l} m interpreted as a sequence of matches (l's) and mismatches (0's) of 
individual characters from the alphabet of input sequences. A seed Q = {pi,p2, ■ ■ ■ ,Pd} matches 
a similarity w at position i, 1 < % < m — pa + 1, iff for every j E [l..d], we have w[i + Pj] = 1. 
In this case, we also say that seed Q has an occurrence in similarity w at position i. A seed Q 
is said to detect a similarity w if Q has at least one occurrence in w. 

Given a similarity length m and a number of mismatches k, consider all similarities of length 
m containing k 0's and (m — k) l's. These similarities are called (m, /^-similarities. A seed 
Q solves the detection problem (m, k) (for short, the (m, /c)-problem) iff all of (™) (m, k)- 
similarities w are detected by Q. For example, one can check that seed #-##--#-## solves 
the (15, 2) -problem. 

Note that the weight of the seed is directly related to the selectivity of the corresponding 
filtration procedure. A larger weight improves the selectivity, as less similarities will pass through 
the filter. On the other hand, a smaller weight reduces the filtration efficiency. Therefore, the 
goal is to solve an (m, A;)-problem by a seed with the largest possible weight. 

Solving (m, A;)-problems by a single seed has been studied by Burkhardt and Karkkainen fl}. 
An extension we propose here is to use a family of seeds, instead of a single seed, to solve the 
(m, A;)-problem. Formally, a finite family of seeds F =< Qi >f =t solves an (m, k)-problem iff 
for any (m, k) -similarity w, there exists a seed Qi E F that detects w. 

Note that the seeds of the family are used in the complementary (or disjunctive) fashion, i.e. 
a similarity is detected if it is detected by one of the seeds. This differs from the conjunctive 
approach of [7| where a similarity should be detected by two seeds simultaneously. 

The following example motivates the use of multiple seeds. In (U, it has been shown that a seed 
solving the (25, 2)-problem has the maximal weight 12. The only such seed (up to reversal) is 
###-# — ###-# — ###-#. However, the problem can be solved by the family composed of the 
following two seeds of weight 14: #####-## — #####-## and #-## — #####-## — ####. 

Clearly, using these two seeds increases the selectivity of the search, as only similarities having 
14 or more matching characters pass the filter vs 12 matching characters in the case of single 



5 

seed. On uniform Bernoulli sequences, this results in the decrease of the number of candidate 
similarities by the factor of |A| 2 /2, where A is the input alphabet. This illustrates the advantage 
of the multiple seed approach: it allows to increase the selectivity while preserving a lossless 
search. The price to pay for this gain in selectivity is multiplying the work on identifying the 
seed occurrences. In the case of large sequences, however, this is largely compensated by the 
decrease in the number of false positives caused by the increase of the seed weight. 

III. Computing properties of seed families 

Burkhardt and Karkkainen HI proposed a dynamic programming algorithm to compute the 
optimal threshold of a given seed - the minimal number of its occurrences over all possible 
(m, k)- similarities. In this section, we describe an extension of this algorithm for seed families 
and, on the other hand, describe dynamic programming algorithms for computing two other 
important parameters of seed families that we will use in a later section. 

Consider an (m, k) -problem and a family of seeds F =< Qi >f =1 . We need the following 
notation. 

• s max = max{s(Qi)}f =1 , s min = min{s(Qi)}f =1 , 

• for a binary word w and a seed Q h suff(Q h w) = 1 if Qi matches w at position (\w\— s(Qi)+l) 
(i.e. matches a suffix of w), otherwise suff(Qi,w)=0, 

• last(w) = 1 if the last character of w is 1, otherwise last(w) = 0, 

• zeros(w) is the number of 0's in w. 

A. Optimal threshold 

Given an (m, /c)-problem, a family of seeds F =< Qi >f =1 has the optimal threshold Tp(m, k) 
if every (m, k) -similarity has at least Tp(m, k) occurrences of seeds of F and this is the maximal 
number with this property. Note that overlapping occurrences of a seed as well as occurrences 
of different seeds at the same position are counted separately. For example, the singleton family 
{###-##} has threshold 2 for the (15, 2) -problem. 

Clearly, F solves an (m, /c)-problem if and only if T F (m, k) > 0. If T F (m, k) > 1, then one 
can strengthen the detection criterion by requiring several seed occurrences for a similarity to 



T F (i,j,w[l..n}) 



be detected. This shows the importance of the optimal threshold parameter. 

We now describe a dynamic programming algorithm for computing the optimal threshold 
T F (m, k). For a binary word w, consider the quantity T F (m, k, w) defined as the minimal number 
of occurrences of seeds of F in all (m, k) -similarities which have the suffix w. By definition, 

Tp(m, k) = T F (m, k, e). Assume that we precomputed values T F (j, w) = T F (s max , j, w), for all 
j < raax{k, s max }, \w\ = s max . The algorithm is based on the following recurrence relations on 

T F (i,j,w), for i > 

( 

T F (j,w), if i = s maxj 

T F (i-l,j-l,w[l..n-l}), if w[n]=0, 

T F (i-l,j,w[l..n-l)) + \^2f =1 suff(Qi,w)], if n=s max , 
mm{T F (i,j, l.w),T F (i,j, O.w)}, if zeros(w) <j, 

T F (i,j,l.w), if zerosiw) —j. 

The first relation is an initial condition of the recurrence. The second one is based on the fact 
that if the last symbol of w is 0, then no seed can match a suffix of w (as the last position of 
a seed is always assumed to be a matching position). The third relation reduces the size of the 
problem by counting the number of suffix seed occurrences. The fourth one splits the counting 
into two cases, by considering two possible characters occurring on the left of w. If w already 
contains j 0's, then only 1 can occur on the left of w, as stated by the last relation. 

A dynamic programming implementation of the above recurrence allows to compute 
T F (m, k, e) in a bottom-up fashion, starting from initial values T F (j, w) and applying the above 
relations in the order in which they are given. A straightforward dynamic programming im- 
plementation requires 0{m ■ k ■ 2^ Smax+l ' ) ) time and space. However, the space complexity can 
be immediately improved: if values of i are processed successively, then only 0(k ■ 2^ Smax+1 ') 
space is needed. Furthermore, for each i and j, it is not necessary to consider all 2^ Smax+1 ^ 
different strings w, but only those which contain up to j 0's. The number of those w is 
g{j,s max ) = J2l=o ( Sm f ? x ). For each i, j ranges from to k. Therefore, for each i, we need 
to store f(k,s max ) = Y!}=o9(h s ma X ) = Y!}=o (''7") " ~j + 1) values. This yields the same 
space complexity as for computing the optimal threshold for one seed fl}. 



The quantity Yli=i su ff{Qu w) can be precomputed for all considered words w in time 0(L ■ 
g(k, s m ax)) and space 0(g(k, s max )), under the assumption that checking an individual match is 
done in constant time. This leads to the overall time complexity 0(m- f(k, s max ) + L- g(k, s max )) 
with the leading term m • f(k, s max ) (as L is usually small compared to m and g(k, s max ) is 
smaller than f(k, s max )). 

B. Number of undetected similarities 

We now describe a dynamic programming algorithm that computes another characteristic of a 
seed family, that will be used later in Section HV-DI Consider an (m, A;)-problem. Given a seed 
family F =< Qi >fL 1 , we are interested in the number U F (m, k) of (m, /^-similarities that are 
not detected by F. For a binary word w, define U F (m, k,w) to be the number of undetected 
(m, k) -similarities that have the suffix w. 

Similar to |[T0l , let X(F) be the set of binary words w such that (i) \w\ < s max , (ii) for any 
Qi e F, suff{Q h \ s ™*-\ w \ w ) = 0, and (Hi) no proper suffix of w satisfies (ii). Note that word 
belongs to X(F), as the last position of every seed is a matching position. 

The following recurrence relations allow to compute U F (i,j,w) for i < m, j < k, and 

|^| ^ Smax- 

/ 

(j— zeros(w)) ' if Z *C >S m j n , 

0, iflle[l..L],suff(Q l ,w) = l, 

U F (i - l,j - last(w),w[l..n - 1]), if w e X(F), 
U F (i,j, l.w) + U F (i,j, O.w), if zeros(w) < j, 

Up(i, j, if zeros[w) = j. 

The first condition says that if i < s min , then no word of length i will be detected, hence the 
binomial coefficient. The second condition is straightforward. The third relation follows from 
the definition of X(F) and allows to reduce the size of the problem. The last two conditions 
are similar to those from the previous section. 

The set X(F) can be precomputed in time 0(L-g(k, s max )) and the worst-case time complexity 
of the whole algorithm remains 0(m ■ f(k, s max ) + L ■ g(k, s max )). 



U F (i,j,w[l..n]) 



C. Contribution of a seed 

Using a similar dynamic programming technique, one can compute, for a given seed of the 
family, the number of (m, k) -similarities that are detected only by this seed and not by the 
others. Together with the number of undetected similarities, this parameter will be used later in 
Section HV-Dl 

Given an (m, /c)-problem and a family F =< Qi >f =1 , we define S F (m, k, I) to be the number 
of (m, k) -similarities detected by the seed Qi exclusively (through one or several occurrences), 
and S F (m, k,l,w) to be the number of those similarities ending with the suffix w. A dynamic 
programming algorithm similar to the one described in the previous sections can be applied to 
compute S F (m, k, I). The recurrence is given below. 



S F (i,j,l,w[l..n\) 



if i<s min or 31'^ I suff{Qi>,w) = l 

S F {i-l,j-l,l,w[l..n-l}) if w[n] = 



S F (i-l,j,l,w[l..n-l]) 
S F (i-l,j,l,w[l..n-l)) 



ifn= \Qi\ and suff(Q h w) = 
if n = s max and suff(Q h w) = 1 



+ U F {i-l,j,w[l..n-l}) and V/' ^ l,suff(Q v ,w) = 0, 
S F (i,j,l, l.w[l..n\) 



if zeros(w) < j 
if zeros(w ) = j 



+ S F (i,j,l,0.w[l..n]) 
S F (i,j,l, l.u>[l..n]) 

The third and fourth relations play the principal role: if Qi does not match a suffix of iw[l..n], 
then we simply drop out the last letter. If Qi matches a suffix of iu[l..n], but no other seed does, 
then we count prefixes matched by Qi exclusively (term S F (i — 1, j, I, w[l..n — 1])) together with 
prefixes matched by no seed at all (term U F (i — l,j,w[l..n — 1])). The latter is computed by 
the algorithm of the previous section. 

The complexity of computing S F (m,k,l) for a given / is the same as the complexity of 
dynamic programming algorithms from the previous sections. 



9 

IV. Seed design 

In the previous Section we showed how to compute various useful characteristics of a given 
family of seeds. A much more difficult task is to find an efficient seed family that solves a given 
(m, £;) -problem. Note that there exists a trivial solution where the family consists of all (™) 
position combinations, but this is in general unacceptable in practice because of a huge number 
of seeds. Our goal is to find families of reasonable size (typically, with the number of seeds 
smaller than ten), with a good filtration efficiency. 

In this section, we present several results that contribute to this goal. In Section |IV-A[ we start 
with the case of single seed with a fixed number of jokers and show, in particular, that for one 
joker, there exists one best seed in a sense that will be defined. We then show in Section IIV-BI 
that a solution for a larger problem can be obtained from a smaller one by a regular expansion 
operation. In Section IIV-CI we focus on seeds that have a periodic structure and show how those 
seeds can be constructed by iterating some smaller seeds. We then show a way to build efficient 
families of periodic seeds. Finally, in Section IIV-DI we briefly describe a heuristic approach to 
constructing efficient seed families that we used in the experimental part of this work presented 
in Section |V] 

A. Single seeds with a fixed number of jokers 

Assume that we fixed a class of seeds under interest (e.g. seeds of a given minimal weight). 
One possible way to define the seed design problem is to fix a similarity length m and find 
a seed that solves the (m, A;)-problem with the largest possible value of k. A complementary 
definition is to fix k and minimize m provided that the (m, A;) -problem is still solved. In this 
section, we adopt the second definition and present an optimal solution for one particular case. 

For a seed Q and a number of mismatches k, define the k-critical length for Q as the minimal 
value m such that Q solves the (m, fc)-problem. For a class of seeds C and a value k, a seed is 
fc-optimal in C if Q has the minimal ^-critical length among all seeds of C. 

One interesting class of seeds C is obtained by putting an upper bound on the possible number 
of jokers in the seed, i.e. on the number (s(Q) — w(Q)). We have found a general solution of 
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the seed design problem for the class Ci(n) consisting of seeds of weight d with only one joker, 
i.e. seeds # d ~ r -# r . 

Consider first the case of one mismatch, i.e. k = 1. A 1-optimal seed from C\{d) is # d ~ r -# r 
with r = |_cZ /2 J . To see this, consider an arbitrary seed Q = p + q = d, and assume 

by symmetry that p > q. Observe that the longest (m, l)-similarity that is not detected by Q 
is p _1 0P +9 of length (2p + q). Therefore, we have to minimize 2p + q = d + p, and since 
P > \d/2], the minimum is reached for p = \d/2], q = [d/2\. 

However, for k > 2, an optimal seed has an asymmetric structure described by the following 
theorem. 

Theorem 1: Let n be an integer and r = [d/3] ([x] is the closest integer to x). For every 
k > 2, seed Q(d) = # d ~ r -# r is A;-optimal among the seeds of C\(d). 

Proof: Again, consider a seed Q = # p -# c/ , p + q = d, and assume that p > q. Consider 
the longest word S(k) from (l*0) fc l*, k > 1, which is not detected by Q and let L(k) is the 
length of S(k). By the above remark, S(l) = P^OP^ and L(l) = 2p + q. 

It is easily seen that for every k, S(k) starts either with P _1 0, or with P^OP^O. Define 
L'{k) to be the maximal length of a word from (l*0) fc l* that is not detected by Q and starts 
with P^O. Since prefix P^O implies no additional constraint on the rest of the word, we have 
L'{k) = q + L{k — 1). Observe that L'{\) = p + 2q (word P -1 0P +l? ). To summarize, we have 
the following recurrences for k > 2: 

L'{k) = q + L(k-l), (1) 
L(k) = max{p + L(k - l),p + q + 1 + L'(k-l)}, (2) 

with initial conditions L'(l) = p + 2q, L{1) = 2p + q. 

Two cases should be distinguished. If p > 2q + 1, then the straightforward induction shows 
that the first term in © is always greater, and we have 

L(k) = (k + l)p + q, (3) 

and the corresponding longest word is 

S(k) = (l p - 1 0) k l p+q . (4) 



1 1 



If q < p < 2q + 1, then by induction, we obtain 



(£ + l)p+(k + l)q + £ 

L(k) = 

\(£ + 2)p + kq + £ 



if k 



if k 



2£, 



2£+l, 



(5) 



and 



S(k) 




if k 



21, 



(6) 



l p_1 0(l p+(? 01 9_1 0) £ l p+9 



if k 



2£ + l. 



By definition of L(k), seed # p -# 9 detects any word from (l*0) fe l* of length (L(k) + 1) or 
more, and this is the tight bound. Therefore, we have to find p, q which minimize L{k). Recall 
that p + q = d, and observe that for p > 2q+ 1, L{k) (defined by ©) is increasing on p, while 
for p < 2q + 1, L{k) (defined by ©) is decreasing on p. Therefore, both functions reach its 
minimum when p = 2q+ 1. Therefore, if d = 1 (mod 3), we obtain q = [d/3\ and p = d — q. If 
<i = (mod 3), a routine computation shows that the minimum is reached at q = d/3, p = 2d/3, 
and if d = 2 (mod 3), the minimum is reached at q = \d/3] , p = d — q. Putting the three cases 
together results in q — [d/3], p — d — q. ■ 

To illustrate Theorem Q3 seed ####-## is optimal among all seeds of weight 6 with one 
joker. This means that this seed solves the (m, 2)-problem for all m > 16 and this is the smallest 
possible bound over all seeds of this class. Similarly, this seed solves the (m, 3)-problem for all 
m > 20, which is the best possible bound, etc. 

B. Regular expansion and contraction of seeds 

We now show that seeds solving larger problems can be obtained from seeds solving smaller 
problems, and vice versa, using regular expansion and regular contraction operations. 

Given a seed Q , its i-regular expansion i <g> Q is obtained by multiplying each matching 
position by i. This is equivalent to inserting % — 1 jokers between every two successive positions 
along the seed. For example, if Q = {0,2,3,5} (or #-##-#), then the 2-regular expansion of 

Q is 2®Q = {0,4,6, 10} (or # #-# #). Given a family F, its i-regular expansion i®F 

is the family obtained by applying the ?-regular expansion on each seed of F. 
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Lemma 1: If a family F solves an (m, k) -problem, then the (im, (i + l)k — l)-problem is 
solved both by family F and by its ^-regular expansion F = z ® F. 

Proof: Consider an (im, (i + l)k — l)-similarity w. By the pigeon hole principle, it contains 
at least one substring of length m with k mismatches or less, and therefore F solves the (im, (i + 
l)k — 1) -problem. On the other hand, consider i disjoint subsequences of w each one consisting 
of m positions equal modulo i. Again, by the pigeon hole principle, at least one of them contains 
k mismatches or less, and therefore the (im, (i + l)k — l)-problem is solved by i ® F. ■ 

The following lemma is the inverse of LemmaCQ it states that if seeds solving a bigger problem 
have a regular structure, then a solution for a smaller problem can be obtained by the regular 
contraction operation, inverse to the regular expansion. 

Lemma 2: If a family Fj = i® F solves an (im, /c)-problem, then F solves both the (im, k)- 
problem and the (m, \_k/i\ ) -problem. 

Proof: One can even show that F solves the (im, A;)-problem with the additional restriction 
for F to match inside one of the position intervals [l..m], [m + 1..2m], . . . , [(i — l)m + l.Am]. 
This is done by using the bijective mapping from Lemma [TJ given an (im, A;) -similarity w, 
consider i disjoint subsequences Wj (0 < j < i — 1) of w obtained by picking m positions equal 
to j modulo i, and then consider the concatenation w' = W1W2 ■ ■ ■ Wi-ivo®. 

For every (im, /c)-similarity w', its inverse image w is detected by Ft, and therefore F detects 
w' at one of the intervals [l..m], [m + I. .2m], ...,[(« — i)m + l.Am]. Futhermore, for any 
(m, [k/i\ )- similarity v, consider w' = v l and its inverse image w. As w' is detected by Fj, v is 
detected by F. ■ 

Example 1: To illustrate the two lemmas above, we give the following example pointed out 
in Q~|. The following two seeds are the only seeds of weight 12 that solve the (50, 5) -problem: 

#-#-# — # #-#-# — # #-#-# — # and ###-# — ###-# — ###-#. The first 

one is the 2-regular expansion of the second. The second one is the only seed of weight 12 that 
solves the (25, 2)-problem. 

The regular expansion allows, in some cases, to obtain an efficient solution for a larger problem 
by reducing it to a smaller problem for which an optimal or a near-optimal solution is known. 



13 

C. Periodic seeds 

In this section, we study seeds with a periodic structure that can be obtained by iterating 
a smaller seed. Such seeds often turn out to be among maximally weighted seeds solving a 
given (m, A;)-problem. Interestingly, this contrasts with the lossy framework where optimal seeds 
usually have a "random" irregular structure. 

Consider two seeds Qi,Q 2 represented as words over {#,-}. In this section, we lift the 
assumption that a seed must start and end with a matching position. We denote [Qi,Q2] 1 the 
seed defined as (QiQ 2 )" L Qi. For example, [###-# ,—] 2 = ###-# — ###-# — ###-#. 

We also need a modification of the (m, A;)-problem, where (m, A;) -similarities are considered 
modulo a cyclic permutation. We say that a seed family F solves a cyclic (m, k)-problem, if for 
every (m, A;) -similarity w, F detects one of cyclic permutations of w. Trivially, if F solves an 
(m, /c)-problem, it also solves the cyclic (m, A;)-problem. To distinguish from a cyclic problem, 
we call sometimes an (m, A;)-problem a linear problem. 

We first restrict ourselves to the single-seed case. The following lemma demonstrates that 
iterating smaller seeds solving a cyclic problem allows to obtain a solution for bigger problems, 
for the same number of mismatches. 

Lemma 3: If a seed Q solves a cyclic (m, A;)-problem, then for every % > 0, the seed Qi = 
[Q, — ( m-a w))] 1 solves the linear (m ■ (i + 1) + s(Q) — 1, A;)-problem. If i ^ 0, the inverse holds 
too. 

Proof: Consider an (m-(i + 1) +s(Q) — 1, fc) -similarity u. Transform u into a similarity 
v! for the cyclic (m, k) -problem as follows. For each mismatch position t of u, set at position 
(£ mod m) in v! . The other positions of u' are set to 1. Clearly, there are at most k 0's in u. 
As Q solves the (m, A;) -cyclic problem, we can find at least one position j, 1 < j < m, such 
that Q detects u' cyclicly. 

We show now that Qi matches at position j of u (which is a valid position as 1 < j < m and 
s(Qi) = im + s(Q)). As the positions of 1 in u are projected modulo m to matching positions 
of Q, then there is no under any matching element of Qi, and thus Qi detects u. 

<= Consider a seed Qi = [Q, -(' m - s ( ( 3))] i solving the (m • (i + 1) + s(Q) - 1, A;)-problem. As 
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i > 0, consider (m ■ (i + 1) + s(Q) — 1, /^-similarities having all their mismatches located inside 
the interval [m, 2m — 1]. For each such similarity, there exists a position j, 1 < j < m, such that 
Qi detects it. Note that the span of Qi is at least m + s(Q), which implies that there is either 
an entire occurrence of Q inside the window [m, 2m — 1], or a prefix of Q matching a suffix of 
the window and the complementary suffix of Q matching a prefix of the window. This implies 
that Q solves the cyclic (m, A;) -problem. ■ 
Example 2: Observe that the seed ###-# solves the cyclic (7, 2)-problem. From Lemma [3l 
this implies that for every % > 0, the (11 + 7i, 2)-problem is solved by the seed [###-#, — ]' of 
span 5 + 1%. Moreover, for 2 = 1,2, 3, this seed is optimal (maximally weighted) over all seeds 
solving the problem. 

By a similar argument based on Lemma [3l the periodic seed [#####-##, \ solves the 

(18 + Hi, 2) -problem. Note that its weight grows as ^-m compared to |m for the seed from 
the previous paragraph. However, when m — > oo, this is not an asymptotically optimal bound, 
as we will see later. 

The (18 + lli, 3)-problem is solved by the seed (###-# — #, )\ as seed ###-# — # 

solves the cyclic (11, 3)-problem. For i — 1, 2, the former is a maximally weighted seed among 
all solving the (18 + Hi, 3) -problem. 

One question raised by these examples is whether iterating some seed could provide an 
asymptotically optimal solution, i.e. a seed of maximal asymptotic weight. The following theorem 
establishes a tight asymptotic bound on the weight of an optimal seed, for a fixed number of 
mismatches. It gives a negative answer to this question, as it shows that the maximal weight 
grows faster than any linear fraction of the similarity size. 

Theorem 2: Consider a constant k. Let w(m) be the maximal weight of a seed solving the 
cyclic (m, /c)-problem. Then (m — w{m)) = Q(m~k~). 

Proof: Note first that all seeds solving cyclic a (m, k) -problem can be considered as seeds 
of span m. The number of jokers in any seed Q is then n = m — w(Q). The theorem states that 
the minimal number of jokers of a seed solving the (m, /c)-problem is 9(m i fc i ) for every fixed 
k. 
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Lower bound Consider a cyclic (m, /c)-problem. The number D(m, k) of distinct cyclic (m, k)- 
similarities satisfies 



(?) 



??? 



< D(m, fc), 



(7) 



as every linear (m, fc)-similarity has at most m cyclicly equivalent ones. Consider a seed Q. Let 
n be the number of jokers in Q and Jg(m, fc) the number of distinct cyclic (m, A;) -similarities 
detected by Q. Observe that jQ(m, k) < (?) and if Q solves the cyclic (m, A;)-problem, then 



From © and ©, we have 



D(m, k) = J Q (m, k) < 



m 



Using the Stirling formula, this gives n{k) = VL{m 

B 

step 1 



step i 



fc-i 
k 



step k 




(8) 



(9) 



Q ##-## ##-##' 



Fig. 1 

Construction of seeds Qi from the proof of Theorem[2] Jokers are represented in white and matching 

POSITIONS IN BLACK. 



Upper bound To prove the upper bound, we construct a seed Q that has no more then k ■ m~k~ 
joker positions and solves the cyclic (m, &)-problem. 

We start with the seed Q of span m with all matching positions, and introduce jokers into it 
in k steps. After step i, the obtained seed is denoted Qi, and Q = Q k . 
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Let B = |~m£~|. Qi is obtained by introducing into Q individual jokers with periodicity B 

by placing jokers at positions 1, B + 1, 2B + 1, . . .. At step 2, we introduce into Qi contiguous 

intervals of jokers of length B with periodicity B 2 , such that jokers are placed at positions 

[1 . . . B], [B 2 + 1 . . . B 2 + B], [2B 2 + 1 . . . IB 2 + B], . . .. 
In general, at step i (i < k), we introduce into Qi intervals of B' l ~ l jokers with periodicity 

B i at positions [1 . . . B*" 1 ], [B* + 1 . . . B i + B*' 1 ], . . . (see Figure ID- 
Note that Qi is periodic with periodicity B\ Note also that at each step i, we introduce at 

most [wi 1- *] intervals of B l ~ x jokers. Moreover, due to overlaps with already added jokers, 

each interval adds [B — new jokers. 

This implies that the total number of jokers added at step i is at most m 1_ £ ■ (B — l) 1 ^ 1 < 

m l ~^ ■ m^'^~ 1 ' ) = m^TT . Thus, the total number of jokers in Q is less than k ■ . 

By induction on i, we prove that for any (m, z)-similarity u (i < k), Qi detects u cyclicly, that 

is there is a cyclic shift of Qi such that all i mismatches of u are covered with jokers introduced 

at steps 1, . . . , i. 

For i=l, the statement is obvious, as we can always cover the single mismatch by shifting 
Qi by at most (B — 1) positions. Assuming that the statement holds for (i — 1), we show now 
that it holds for i too. Consider an (m, z)-similarity u. Select one mismatch of u. By induction 
hypothesis, the other (i — 1) mismatches can be covered by Qi-i- Since Qi-i has period B 1 ^ 1 
and Qi differs from Qi-i by having at least one contiguous interval of B l ~ l jokers, we can 
always shift Qi by j ■ B 1 ^ 1 positions such that the selected mismatch falls into this interval. This 
shows that Qi detects u. We conclude that Q solves the cyclic (m, i)-problem. ■ 

Using Theorem [2l we obtain the following bound on the number of jokers for the linear 
(m, £;) -problem. 

Lemma 4: Consider a constant k. Let w(m) be the maximal weight of a seed solving the 
linear (m, /c)-problem. Then (m — w(m)) = OfmH 1 ). 

Proof: To prove the upper bound, we construct a seed Q that solves the linear (m, k)- 
problem and satisfies the asymptotic bound. Consider some I < m that will be defined later, 
and let P be a seed that solves the cyclic (I, A;)-problem. Without loss of generality, we assume 
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S (P) = I. 

For a real number e > 1, define P e to be the maximally weighted seed of span at most l e of 
the form P' ■ P ■ ■ ■ P ■ P" , where P' and P" are respectively a suffix and a prefix of P. Due to 
the condition of maximal weight, w(P e ) > e ■ w(P). 

We now set Q = P e for some real e to be defined. Observe that if e • I < m — I, then Q solves 
the linear (m, fc)-problem. Therefore, we set e = ^f^. 

fc — 1 

From the proof of Theorem [21 we have I — w{P) < k ■ . We then have 

w{Q) = e-w{P)> r ^^-{l-k-l h ^r). (10) 

If we set 

/ = m fc +i, (11) 

we obtain 

h h — 1 

m — w(Q) < (k + 1)771^+1 — km (12) 

and as A; is constant, 

m-w(Q)=0(m&). (13) 

The lower bound is obtained similarly to Theorem [2l Let Q be a seed solving a linear (m, k)- 
problem, and let n = m — w(Q). From simple combinatorial considerations, we have 

(:)<:)■<»-<«»*(:)- 

which implies n = f2(mfe+r) for constant fc. ■ 

The following simple lemma is also useful for constructing efficient seeds. 

Lemma 5: Assume that a family F solves an (m, fc)-problem. Let F' be the family obtained 
from F by cutting out I characters from the left and r characters from the right of each seed of 
F. Then F' solves the (m — r — I, fc) -problem. 

Example 3: The (9 + 7i, 2)-problem is solved by the seed [###, -#--]' which is optimal for 
i = l,2, 3. Using Lemma[51 this seed can be immediately obtained from the seed [###-#, --]' 
from Example [2l solving the (11 + 7i, 2)-problem. 

We now apply the above results for the single seed case to the case of multiple seeds. 
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For a seed Q considered as a word over {#,-}, we denote by Qu] its cyclic shift to the left by 
% characters. For example, if Q = ####-#-##—, then Q [5] = #-## — ####-. The following 
lemma gives a way to construct seed families solving bigger problems from an individual seed 
solving a smaller cyclic problem. 

Lemma 6: Assume that a seed Q solves a cyclic (m, /c)-problem and assume that s(Q) = m 
(otherwise we pad Q on the right with (m — s(Q)) jokers). Fix some i > 1. For some L > 0, 
consider a list of L integers < j± < ■ ■ • < jz < m, and define a family of seeds F =< 
\\(Q\ji]Y\\ > t=v wn erc ||(Q[^]) l || stands for the seed obtained from (Q[j t ]) 1 by deleting the joker 
characters at the left and right edges. Define 5(1) = ((ji-i — ji) mod m) (or, alternatively, 
5(1) = ((ji ~ Ji-i) mod m )) for all /, 1 < / < L. Let m' = max{s(\\(Q m y\\) + 5(l)}f =1 - 1. 
Then F solves the (m', k) -problem. 

Proof: The proof is an extension of the proof of Lemma [3] Here, the seeds of the family 
are constructed in such a way that for any instance of the linear (m', /c)-problem, there exists at 
least one seed that satisfies the property required in the proof of Lemma [3] and therefore matches 
this instance. ■ 

In applying Lemma [6l integers j l are chosen from the interval [0, m) in such a way that values 
s (\ \(Q[ji\y\ I) + <K0 are closed to each other. We illustrate Lemma [6] with two examples that 
follow. 

Example 4: Let m = 11, k — 2. Consider the seed Q = ####-#-##-- solving the cyclic 
(11, 2)-problem. Choose % — 2, L — 2, j x — 0, j 2 = 5. This gives two seeds Q x = ||(Q[o]) 2 || = 
####-#-## — ####-#-## and Q 2 = \\(Q [5] ) 2 \\ = #-## — ####-#-## — #### of span 20 
and 21 respectively, 5(1) = 6 and 5(2) = 5. max{20 + 6, 21 + 5} — 1 = 25. Therefore, family 
F — {Q11Q2} solves the (25, 2)-problem. 

Example 5: Let m = 11, k = 3. The seed Q = ###-#--# solving the cyclic (11,3)- 

problem. Choose % — 2, L — 2, ji = 0, j 2 = 4. The two seeds are Qi = ||(<3[o]) 2 || = 
###-# — # — ###-# — # (span 19) and Q 2 = UQ^) 2 ]] = #—#—###-#—#—### 
(span 21), with 5(1) = 7 and 5(2) = 4. max{19 + 7,21 + 4} - 1 = 25. Therefore, family 
F = {Qi,Q2} solves the (25, 3)-problem. 
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D. Heuristic seed design 

Results of Sections [IV-AHIV-CI allow one to construct efficient seed families in certain cases, but 
still do not allow a systematic seed design. Recently, linear programming approaches to designing 
efficient seed families were proposed in |fT9l and in [fT8l , respectively for DNA and protein 
similarity search. However, neither of these methods aims at constructing lossless families. 

In this section, we outline a heuristic genetic programming algorithm for designing lossless 
seed families. The algorithm will be used in the experimental part of this work, that we present in 
the next section. Note that this algorithm uses dynamic programming algorithms of Section Unl 
Since the algorithm uses standard genetic programming techniques, we give only a high-level 
description here without going into all details. 

The algorithm tries to iteratively improve characteristics of a population of seed families 
until it finds a small family that detects all (m, /^-similarities (i.e. is lossless). The first step of 
each iteration is based on screening current families against a set of difficult similarities that 
are similarities that have been detected by fewer families. This set is continually reordered and 
updated according to the number of families that don't detect those similarities. For this, each 
set is stored in a tree and the reordering is done using the list-as-a-tree principle EUl : each time 
a similarity is not detected by a family, it is moved towards the root of the tree such that its 
height is divided by two. 

For those families that pass through the screening, the number of undetected similarities is 
computed by the dynamic programming algorithm of Section IIII-Bl The family is kept if it 
produces a smaller number than the families currently known. An undetected similarity obtained 
during this computation is added as a leaf to the tree of difficult similarities. 

To detect seeds to be improved inside a family, we compute the contribution of each seed by 
the dynamic programming algorithm of Section MI- CI The seeds with the least contribution are 
then modified with a higher probability. In general, the population of seed families is evolving 
by mutating and crossing over according to the set of similarities they do not detect. Moreover, 
random seed families are regularly injected into the population in order to avoid local optima. 

The described heuristic procedure often allows efficient or even optimal solutions to be 



computed in a reasonable time. For example, in ten runs of the algorithm, we found 3 of the 6 
existing families of two seeds of weight 14 solving the (25, 2)-problem. The whole computation 
took less than 1 hour, compared to a week of computation needed to exhaustively test all seed 
pairs. Note that the randomized-greedy approach (incremental completion of the seed set by 
adding the best random seed) applied a dozen of times to the same problem yielded only sets 
of three and sometimes four, but never two seeds, taking about 1 hour at each run. 

V. Experiments 

We describe two groups of experiments that we made. The first one concerns the design 
of efficient seed families, and the second one applies a multi-seed lossless filtration to the 
identification of unique oligos in a large set of EST sequences. 

Seed design experiments 

TABLE I 



Seed families for (25, 2)-problem 



size 


weight 


family seeds 


8 






1 


-|2 e 'P>9 


###-#—###-#—###-# 


5.96 




8 


2 


-|4e>P,9 


####-#-##—####-#-## 


7.47 


io- 


9 


3 


15'' 


#-##—####-#-##—#### 
#—##-#-######—##-#-## 


2.80 


io- 


9 






#-######—##-#-##### 

####—##-#-######—## 








4 




###-##-#-###—####### 


9.42 


io- 


10 






##-#-###—#######-##-# 

###—#######-##-#-### 












#######-##-#-###—### 








6 


17'' 


##-#-##—#######-####-# 


3.51 


io- 


10 






#-##—#######-####-#-## 

#######-####-#-##—### 












###-####-#-##—####### 












####-#-##—#######-### 












##—#######-####-#-##—# 









We considered several (m, k) -problems. For each problem, and for a fixed number of seeds 
in the family, we computed families solving the problem and realizing the largest possible seed 
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TABLE II 



Seed families for (25, 3)-problem 



size 


weight 


family seeds 










<5 






1 


8 =.p.s 


###-# ###-# 










1.53 


10" 




2 


10 p 


— # ## 










1.91 


10" 


e 


3 


11P 


## — # ### 

#—####-#-##—#— 


#-#-#■ 
-## 


: 






7.16 


10" 


7 






###-#-##—#— 
##--#— 


-#### 
-####- 


-#" 


-##" 


"# 








4 




#—####-#-##—#— 


-### 








2.39 


10" 


7 






###-#-##—#— 
#-##—#— 


-####- 
-####- 


"# 
-#" 


"##- 


"# 












##— #— 


-####- 


-#" 


"##- 


-# — 


--# 







weight (under a natural assumption that all seeds in a family have the same weight). We also 
kept track of the ways (periodic seeds, genetic programming heuristics, exhaustive search) in 
which those families can be computed. 

Tables U and UH summarize some results obtained for the (25, 2)-problem and the (25, 3)- 
problem respectively. Families of periodic seeds (that can be found using Lemma [6]) are marked 
with p , those that are found using a genetic algorithm are marked with g , and those which are 
obtained by an exhaustive search are marked with e . Only in this latter case, the families are 
guaranteed to be optimal. Families of periodic seeds are shifted according to their construction 
(see Lemma [6]). 

Moreover, to compare the selectivity of different families solving a given (m, /c)-problem, we 
estimated the probability 5 for at least one of the seeds of the family to match at a given position 
of a uniform Bernoulli four-letter sequence. This has been done using the inclusion-exclusion 
formula. 

Note that the simple fact of passing from a single seed to a two-seed family results in a 
considerable gain in efficiency: in both examples shown in the tables there a change of about 
one order magnitude in the selectivity estimator 5. 
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Oligo selection using multi-seed filtering 

An important practical application of lossless filtration is the selection of reliable oligonu- 
cleotides for DNA micro-array experiments. Oligonucleotides (oligos) are small DNA sequences 
of fixed size (usually ranging from 10 to 50) designed to hybridize only with a specific region 
of the genome sequence. In micro-array experiments, oligos are expected to match ESTs that 
stem from a given gene and not to match those of other genes. As the first approximation, the 
problem of oligo selection can then be formulated as the search for strings of a fixed length 
that occur in a given sequence but do not occur, within a specified distance, in other sequences 
of a given (possibly very large) sample. Different approaches to this problem apply different 
distance measures and different algorithmic techniques 112111 . E2l . [|23l , 11241 . The experiments we 
briefly present here demonstrate that the multi-seed filtering provides an efficient computation of 
candidate oligonucleotides. These should then be further processed by complementary methods 
in order to take into account other physico-chemical factors occurring in hybridisation, such as 
the melting temperature or the possible hairpin structure of palindromic oligos. 

Here we adopt the formalization of the oligo selection problem as the problem of identifying 
in a given sequence (or a sequence database) all substrings of length m that have no occurrences 
elsewhere in the sequence within the Hamming distance k. The parameters m and k were set 
to 32 and 5 respectively. For the (32, 5)-problem, different seed families were designed and 
their selectivity was estimated. Those are summarized in the table in Figure [21 using the same 
conventions as in Tables U and [n] above. The family composed of 6 seeds of weight 1 1 was 
selected for the filtration experiment (shown in Figure [2]) • 

The filtering has been applied to a database of rice EST sequences composed of 100015 
sequences for a total length of 42,845,242 bp Q. Substrings matching other substrings with 5 
substitution errors or less were computed. The computation took slightly more than one hour 
on a Pentium™4 3GHz computer. Before applying the filtering using the family for the (32, 5)- 
problem, we made a rough pre-filtering using one spaced seed of weight 16 to detect, with a high 
selectivity, almost identical regions. 65% of the database has been discarded by this pre-filtering. 

'source : http : / /bio server .myongji.ac. kr/ricemac .html, The Korea Rice Genome Database 



23 



family size 


weight 
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1 


7 e 


6.10 


■ 10" 
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8 e 


3.05 


■ 10" 


-5 
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9 e 


1.14 


■ 10" 


-5 


4 


109 


3.81 


■ 10" 


-6 


6 


119 


1.43 


■ 10" 


-6 


10 


12" 


5.97 


■ 10" 


-7 



Fig. 

Computed seed families for the (32, 5)-problem 



{ ####—# #—#—#### , 

###--#--## #-#### , 

#### #—#—##-### , 

###-#-#-—##—#### , 
###-##-##--#-#-## , 
####-##-#-#### } 



AND THE CHOSEN FAMILY (6 SEEDS OF WEIGHT 1 1) 



Another 22% of the database has been filtered out using the chosen seed family, leaving the 
remaining 13% as oligo candidates. 

VI. Conclusion 

In this paper, we studied a lossless filtration method based on multi-seed families and demon- 
strated that it represents an improvement compared to the single-seed approach considered in Q]|. 
We showed how some important characteristics of seed families can be computed using the 
dynamic programming. We presented several combinatorial results that allow one to construct 
efficient families composed of seeds with a periodic structure. Finally, we described a large- 
scale computational experiment of designing reliable oligonucleotides for DNA micro-arrays. 
The obtained experimental results provided evidence of the applicability and efficiency of the 
whole method. 

The results of Sections HV-AHIV-Cl establish several combinatorial properties of seed families, 
but many more of them remain to be elucidated. The structure of optimal or near-optimal 
seed families can be reduced to number-theoretic questions, but this relation remains to be 
clearly established. In general, constructing an algorithm to systematically design seed families 
with quality guarantee remains an open problem. Some complexity issues remain open too: 
for example, what is the complexity of testing if a single seed is lossless for given m, kl 
Section [III] implies a time bound exponential on the number of jokers. Note that for multiple 
seeds, computing the number of detected similarities is NP-complete [[161 Section 3.1]. 
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Another direction is to consider different distance measures, especially the Levenstein distance, 
or at least to allow some restricted insertion/deletion errors. The method proposed in 11251 does 
not seem to be easily generalized to multi-seed families, and a further work is required to improve 
lossless filtering in this case. 
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